This R-Markdown holds the Fake Date Analysis assignment. This assignment will be analyzing the fake data of NBA game attendance. Which contains six variable factors: weekdays, time, conference, home team win percentage, opponents win percentage and the number of all-stars. This assignment will take a more in-depth look into the effect of each of the six variables.
First, this analysis will fit a linear regression model and look into the relationship between attendance and weekday conference and the home team wins percentage. With an additional ANOVA test and plotted the result.
head(d)
## # A tibble: 6 x 9
## row weekday time conf national_tv opp_win_p opp_all_stars win_p attendance
## <dbl> <chr> <tim> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Sa 14:00 conf… No 40% - 60% 0 0.469 13084
## 2 2 Th 19:00 conf… No More tha… 1 0.495 17591
## 3 3 Fr 17:00 conf… Yes More tha… 1 0.505 17717
## 4 4 M 17:00 conf… Yes Less tha… 0 0.358 11058
## 5 5 Th 19:00 conf… Yes More tha… 0 0.702 15986
## 6 6 Tu 19:00 Riva… Yes 40% - 60% 0 0.511 19656
d %>% slice_sample(n=10)
## # A tibble: 10 x 9
## row weekday time conf national_tv opp_win_p opp_all_stars win_p
## <dbl> <chr> <tim> <chr> <chr> <chr> <dbl> <dbl>
## 1 3426 Tu 19:00 Riva… No More tha… 2 0.592
## 2 3725 Fr 14:00 Riva… Yes Less tha… 1 0.669
## 3 8478 W 14:00 Non-… No More tha… 1 0.674
## 4 2219 Tu 19:00 Non-… Yes Less tha… 0 0.580
## 5 4452 M 19:00 conf… Yes 40% - 60% 2 0.536
## 6 9173 W 19:00 Non-… Yes More tha… 0 0.520
## 7 8825 Th 19:00 conf… Yes Less tha… 0 0.385
## 8 4758 Th 17:00 Non-… No Less tha… 1 0.541
## 9 6370 Sa 19:00 Riva… Yes Less tha… 0 0.280
## 10 5215 Su 19:00 Non-… No 40% - 60% 3 0.484
## # … with 1 more variable: attendance <dbl>
fit1 <- lm(attendance ~ weekday + conf + win_p,d)
anova(fit1,test="Chisq")
## Analysis of Variance Table
##
## Response: attendance
## Df Sum Sq Mean Sq F value Pr(>F)
## weekday 6 5.2772e+09 8.7954e+08 204.575 < 2.2e-16 ***
## conf 2 6.0391e+10 3.0196e+10 7023.336 < 2.2e-16 ***
## win_p 1 2.2149e+08 2.2149e+08 51.517 7.604e-13 ***
## Residuals 9990 4.2950e+10 4.2993e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit1)
##
## Call:
## lm(formula = attendance ~ weekday + conf + win_p, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6199.6 -1505.6 -13.7 1551.0 6219.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14760.81 118.73 124.327 < 2e-16 ***
## weekdayM -916.50 76.62 -11.961 < 2e-16 ***
## weekdaySa 975.40 77.00 12.668 < 2e-16 ***
## weekdaySu -15.18 76.87 -0.197 0.843
## weekdayTh -932.06 77.43 -12.038 < 2e-16 ***
## weekdayTu -1001.57 77.19 -12.975 < 2e-16 ***
## weekdayW -997.69 76.08 -13.113 < 2e-16 ***
## confNon-Conference -3017.39 44.05 -68.495 < 2e-16 ***
## confRivalry 5017.69 73.80 67.986 < 2e-16 ***
## win_p 1470.42 204.87 7.178 7.6e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2073 on 9990 degrees of freedom
## Multiple R-squared: 0.6054, Adjusted R-squared: 0.605
## F-statistic: 1703 on 9 and 9990 DF, p-value: < 2.2e-16
plot(fit1)
boxplot(fit1[['residuals']],main='Boxplot: Residuals',ylab='residual value')
##We see that the median is close to 0. Further, the 25 and 75 percentile look approximately the same distance from 0, and the non-outlier min and max also look about the same distance from 0. All of this is good as it suggests correct model specification.
ggplot(fit1,aes(y=attendance,x=weekday,color= conf, ))+geom_point()+stat_smooth(method="lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
confint(fit1)
## 2.5 % 97.5 %
## (Intercept) 14528.0844 14993.5375
## weekdayM -1066.6924 -766.3047
## weekdaySa 824.4759 1126.3290
## weekdaySu -165.8667 135.5036
## weekdayTh -1083.8376 -780.2921
## weekdayTu -1152.8890 -850.2582
## weekdayW -1146.8336 -848.5539
## confNon-Conference -3103.7466 -2931.0424
## confRivalry 4873.0147 5162.3593
## win_p 1068.8469 1872.0010
'ggpredict(fit1,colorAsFactor = TRUE,interactive=TRUE)'
## [1] "ggpredict(fit1,colorAsFactor = TRUE,interactive=TRUE)"
Results: Intercept: The intercept tells us that when the game within the same conference on a Friday and the home team win percentage is zero, attendance is 1.476081110^{4}. SE: The standard error is the standard error of our estimate, which allows us to construct marginal confidence intervals for the estimate of that particular feature. Here we can see that the entire confidence interval for weekday, home team win percentage, and conference has a large effect on attendance. R^2: 60.54%% of the attendance is explanied by weekday, win percentage, and conference. F-test: Under the null hypothesis the F statistic will be F distributed with 2073 on 9990 degrees of freedom. The probability of our observed data under the null hypothesis is smaller then 2.2e-16, therefore the variables improve the model’s fit.
Secondly, this analysis will fit additional variables, opponent win percentage, and whether is a national tv game in the linear regression model and look into the relationship, and plot the result.
## Analysis of Variance Table
##
## Model 1: attendance ~ weekday + conf + win_p
## Model 2: attendance ~ weekday + conf + win_p + opp_win_p + national_tv
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 9990 4.2950e+10
## 2 9987 1.7246e+10 3 2.5704e+10 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = attendance ~ weekday + conf + win_p + opp_win_p +
## national_tv, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4404.5 -768.3 7.5 817.3 4124.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15002.58 77.97 192.409 <2e-16 ***
## weekdayM -959.65 48.56 -19.761 <2e-16 ***
## weekdaySa 1038.79 48.80 21.287 <2e-16 ***
## weekdaySu 10.33 48.72 0.212 0.832
## weekdayTh -968.67 49.07 -19.739 <2e-16 ***
## weekdayTu -982.28 48.92 -20.078 <2e-16 ***
## weekdayW -1010.69 48.22 -20.959 <2e-16 ***
## confNon-Conference -2989.13 27.92 -107.056 <2e-16 ***
## confRivalry 4983.93 46.78 106.533 <2e-16 ***
## win_p 1495.70 129.85 11.519 <2e-16 ***
## opp_win_pLess than 40% -2057.49 31.72 -64.873 <2e-16 ***
## opp_win_pMore than 60% 2008.10 31.80 63.141 <2e-16 ***
## national_tvYes -495.55 26.30 -18.846 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1314 on 9987 degrees of freedom
## Multiple R-squared: 0.8415, Adjusted R-squared: 0.8414
## F-statistic: 4420 on 12 and 9987 DF, p-value: < 2.2e-16
Results: Intercept: The intercept tells us that when the game within the same conference on a Friday, the home team win percentage is zero, opponent win percentage between 40-60, and it is not an national tv game. The attendance is 1.500257910^{4}. SE: The standard error is the standard error of our estimate, which allows us to construct marginal confidence intervals for the estimate of that particular feature. Here we can see that the entire confidence interval for weekday, home team win percentage, conference, opponent win percentage, and whether is a national tv or not has a large effect on attendance. R^2: 84.14% of the attendance is explanied by weekday, win percentage, conference, opponent win percentage, and whether is a national tv or not. F-test: Under the null hypothesis the F statistic will be F distributed with 1314 on 9987 degrees of freedom. The probability of our observed data under the null hypothesis is smaller then 2.2e-16, therefore the additional variables improve the model’s fit.
Lastly, the analysis will fit in all variables, time and opponent number of all stars, and build the final regression model look into the fit. Final result will also be showing.
## Analysis of Variance Table
##
## Model 1: attendance ~ weekday + conf + win_p + opp_win_p + national_tv
## Model 2: attendance ~ weekday + conf + win_p + opp_win_p + national_tv +
## opp_all_stars + time_fct
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 9987 1.7246e+10
## 2 9984 2.4975e+09 3 1.4749e+10 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = attendance ~ weekday + conf + win_p + opp_win_p +
## national_tv + opp_all_stars + time_fct, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1829.47 -340.37 -6.63 338.22 1777.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12050.78 33.56 359.124 <2e-16 ***
## weekdayM -1030.89 18.49 -55.762 <2e-16 ***
## weekdaySa 975.06 18.58 52.488 <2e-16 ***
## weekdaySu -43.49 18.55 -2.345 0.0191 *
## weekdayTh -1034.07 18.68 -55.357 <2e-16 ***
## weekdayTu -1018.02 18.62 -54.666 <2e-16 ***
## weekdayW -1021.43 18.35 -55.652 <2e-16 ***
## confNon-Conference -3004.11 10.63 -282.660 <2e-16 ***
## confRivalry 5008.36 17.81 281.266 <2e-16 ***
## win_p 1477.58 49.42 29.897 <2e-16 ***
## opp_win_pLess than 40% -1997.49 12.07 -165.437 <2e-16 ***
## opp_win_pMore than 60% 1990.24 12.11 164.368 <2e-16 ***
## national_tvYes -502.87 10.01 -50.246 <2e-16 ***
## opp_all_stars 992.96 5.93 167.442 <2e-16 ***
## time_fct17:00:00 2003.08 17.79 112.569 <2e-16 ***
## time_fct19:00:00 3000.72 17.45 171.929 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 500.2 on 9984 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.977
## F-statistic: 2.834e+04 on 15 and 9984 DF, p-value: < 2.2e-16
Results: Intercept: The intercept tells us that when the game within the same conference on a Friday, the home team win percentage is zero, opponent win percentage between 40-60, it is not an national tv game, opponent has zero all stars, and the game at 14:00. The attendance is 1.205078510^{4}. SE: The standard error is the standard error of our estimate, which allows us to construct marginal confidence intervals for the estimate of that particular feature. Here we can see that the entire confidence interval for weekday, home team win percentage, conference, opponent win percentage, whether is a national tv or not, time ,and the number of all stars has a large effect on attendance. R^2: 97.7% the of the attendance is explanied by weekday, win percentage, conference, opponent win percentage, whether is a national tv or not, time, and the number of all stars. F-test: Under the null hypothesis the F statistic will be F distributed with 500.2 on 9984 degrees of freedom. The probability of our observed data under the null hypothesis is smaller then 2.2e-16, therefore the additional variables improve the model’s fit. Summary: The adjusted R-squared value show significant improvement on the fit of the model, the more variables the input has, the better the model seems to fit.